We will be using csv files from the Census Bureaus. The first two data sets we use will contain information about public school enrollment while the last four will have information about resident population estimates.
We will first read in and parse the data through separate tasks followed by doing the same thing on a similar data set but converting the tasks into functions that can be used to repeat the process through one function call. We will then call the functions made to read in and parse the enrollment data sets followed by creating a new function to combine them. From the combined data sets, we will write generic functions for summarizing and plotting based on the class object. The program will be concluded by putting all the newly created functions together and practicing on the resident population csv files. This will allow us to read in the data, process the data, combine the data, summarize the data, and lastly plot the data.
The first thing we are going to do is read in a section of the data that we are going to analyze and use to make a function.
library(readr)
library(tidyverse)
sheet1 <- read_csv("https://www4.stat.ncsu.edu/~online/datasets/EDU01a.csv")
Now we are going to filter the duplicate row and select the following
columns in order to trim the data we are analyzing. The columns we are
using have at least one of the following characteristics:
* Area_name (which we will rename as
area_name)
* STCOU
* Any column that ends in “D”
sheet1 <- sheet1 %>%
filter(Area_name != "District of Columbia") %>%
dplyr::select(Area_name, STCOU, ends_with("D")) %>%
rename(area_name = Area_name)
sheet1
We can see how our data is structured and that it is categorized with the country overall statistics first, then the state, and lastly county level data.
Lets convert the data into a long format using the
pivot_longer function where each row will have the
area_name with only one enrollment value.
sheet1_long <- sheet1 %>%
pivot_longer(
cols = ends_with("D"),
names_to = "education_measurement",
values_to = "enrollment")
sheet1_long
We have now changed the data into a longer format where only one education measurement is in each row and its corresponding value is highlighted in the next column.
Now we are going to parse the education_measurement
column to first have the type of survey which is the first three letters
followed by the next four digits and then also pull the year from the
third to last and second to last elements in the string. We will be
using the substr() function to do this. Lastly, we are
going to use the dplyr::select() and
everything() functions to reorder the columns to make it
easier to follow. This part is our personal preference.
# Now get the year from the education_measurement column
sheet1_long$year <- substr(sheet1_long$education_measurement,
start = nchar(sheet1_long$education_measurement)-2,
stop = nchar(sheet1_long$education_measurement)-1)
sheet1_long$year <- as.numeric(sheet1_long$year)
# This allows us to make sure the survey was done in 1900s or 2000s
sheet1_long$year <- ifelse(sheet1_long$year > 25, 1900 + sheet1_long$year, 2000 + sheet1_long$year)
# Now get the first 3 letters and next 4 digits to see what type of survey was taken
sheet1_long$education_measurement <- substr(sheet1_long$education_measurement,
start = 1,
stop = 7)
# Order the data to get the year in front of the measurement
sheet1_long <- sheet1_long %>%
dplyr::select(area_name, STCOU, year, everything())
# Display this new data
sheet1_long
We can see that our data has the year and the type of survey (or
measurement of education) that was taken for corresponding areas. We
want to separate the data into two different tibbles. The first is one
on a county level, which will be called sheet1_county_data
and the other which is not county level data, which will be called
sheet1_not_county_data. We will use the
filter() and grepl() functions to find the
pattern of “County Name, DD” to make that a part of
sheet1_county_data and data that is not a part of this
pattern a part of sheet1_not_county_data. We are also going
to create a class called “county” for sheet1_county_data
and a class called “state” for sheet1_not_county_data.
# Here we are going to make the county level data
sheet1_county_data <- sheet1_long %>%
filter(grepl(pattern = ", \\w\\w", area_name))
# Create the new class called county
class(sheet1_county_data) <- c("county", class(sheet1_county_data))
sheet1_county_data
# Here we are going to make the non-county level data
sheet1_not_county_data <- sheet1_long %>%
filter(!grepl(pattern = ", \\w\\w", area_name))
# Create the new class called state
class(sheet1_not_county_data) <- c("state", class(sheet1_not_county_data))
sheet1_not_county_data
After seeing how our two datasets look, we are going to add to the
sheet1_county_data tibble by creating a new variable called
state which shows what state each county is in. We are
going to use the substr() function to do this. Lastly using
dplyr::select() and everything() functions, we
are going to reorder the columns for them to make sense.
# Get the state from each county and make it a new column
sheet1_county_data$state <- substr(sheet1_county_data$area_name,
start = nchar(sheet1_county_data$area_name)-1,
stop = nchar(sheet1_county_data$area_name))
# Reorder the columns with county and then state column following it
sheet1_county_data <- sheet1_county_data %>%
dplyr::select(area_name, state, everything())
# Show the updated tibble
sheet1_county_data
With adding the state to the sheet1_county_data tibble,
we now want to add the division of the states to the
sheet1_not_county_data, which we will call
division. The divisions will be created based on the Census
Bureau. This new variable will be created by using the
%in% operator. If the division does not correspond to a
state, we want to return “Error” so we know it is not a
US state. Lastly using dplyr::select() and
everything() functions, we are going to reorder the columns
for them to make sense.
# Make all the divisions based on classification
division_1 <- toupper(c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont"))
division_2 <- toupper(c("New Jersey", "New York", "Pennsylvania"))
division_3 <- toupper(c("Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin"))
division_4 <- toupper(c("Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota"))
division_5 <- toupper(c("Delaware", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia",
"District of Columbia", "West Virginia"))
division_6 <- toupper(c("Alabama", "Kentucky", "Mississippi", "Tennessee"))
division_7 <- toupper(c("Arkansas", "Louisiana", "Oklahoma", "Texas"))
division_8 <- toupper(c("Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Utah", "Wyoming"))
division_9 <- toupper(c("Alaska", "California", "Hawaii", "Oregon", "Washington"))
# Create a new column called division to put the states in the correct spot. We will then sort the data so the division is next to the state
sheet1_not_county_data <- sheet1_not_county_data %>%
mutate(
division = ifelse(
area_name %in% division_1, "New England", ifelse(
area_name %in% division_2, "Middle Atlantic", ifelse(
area_name %in% division_3, "East North Central", ifelse(
area_name %in% division_4, "West North Central", ifelse(
area_name %in% division_5, "South Atlantic", ifelse(
area_name %in% division_6, "East South Central", ifelse(
area_name %in% division_7, "West South Central", ifelse(
area_name %in% division_8, "Mountain", ifelse(
area_name %in% division_9, "Pacific", "ERROR"
)
)
)
)
)
)
)
)
)
) %>%
dplyr::select(area_name, division, everything())
# Print the updated tibble
sheet1_not_county_data
Now we want to repeat this process with using functions that we are going to create. Before we do that, we are going to do is read in our next section of data that we are going to analyze and use to make function. The code we are using is the same that was used in the previous section.
sheet2 <- read_csv("https://www4.stat.ncsu.edu/~online/datasets/EDU01b.csv")
We are going to start making our functions to that did all the
previous things we were asked to do for our first set of data. The first
function we are going to create called reshaping_data()
will dplyr::select the columns we need and then pivot the data into the
longer format, like what was done in the previous section. We are also
going to remove the duplicate row for the “District of
Columbia”
reshaping_data <- function(tibble, value_column = "enrollment") {
# Here we are selecting and renaming our desired columns
tibble <- tibble %>%
filter(Area_name != "District of Columbia") %>%
dplyr::select(Area_name, STCOU, ends_with("D")) %>%
rename(area_name = Area_name)
# Here we are changing our data into the long format
tibble_long <- tibble %>%
pivot_longer(
cols = ends_with("D"),
names_to = "education_measurement",
values_to = value_column)
# Now we are going to return the long formatted data
return(tibble_long)
}
We want to create another function, which will be called
find_year_and_survey(). This function will allow us to
parse the education_measurement column to first have the
type of survey which is the first three letters followed by the next
four digits and then also pull the year from the third to last and
second to last elements in the string. We will be using the
substr() function within our newly created function to do
this. Lastly, we are going to use the dplyr::select() and
everything() functions within our function to reorder the
columns to make it easier to follow. This part is our personal
preference, as we did before with the first data set.
find_year_and_survey <- function(tibble_long) {
# Now get the year from the education_measurement column
tibble_long$year <- substr(tibble_long$education_measurement,
start = nchar(tibble_long$education_measurement)-2,
stop = nchar(tibble_long$education_measurement)-1)
tibble_long$year <- as.numeric(tibble_long$year)
# This allows us to make sure the survey was done in 1900s or 2000s
tibble_long$year <- ifelse(tibble_long$year > 25, 1900 + tibble_long$year, 2000 + tibble_long$year)
# Now get the first 3 letters and next 4 digits to see what type of survey was taken
tibble_long$education_measurement <- substr(tibble_long$education_measurement,
start = 1,
stop = 7)
# Order the data to get the year in front of the measurement
tibble_long <- tibble_long %>%
dplyr::select(area_name, STCOU, year, everything())
# Return this new data
return(tibble_long)
}
For the county-level tibble, we are going to create a function called
find_state() which will allow us to create a new variable
called state that describes which state one of these county
measurements corresponds to using substr() within our
function. We are also going to reorder the columns to put the new
state variable next to the area_name or county
of that state.
find_state <- function(tibble_county_data) {
# Get the state from each county and make it a new column
tibble_county_data$state <- substr(tibble_county_data$area_name,
start = nchar(tibble_county_data$area_name)-1,
stop = nchar(tibble_county_data$area_name))
# Reorder the columns with county and then state column following it
tibble_county_data <- tibble_county_data %>%
dplyr::select(area_name, state, everything())
# Show the updated tibble
return(tibble_county_data)
}
Now for the non-county-level tibble, we are going to create a
function called find_division() which will allow us to
create a new variable called division, which is based on
the US divisions defined by the Census
Bureau. This new variable will be created by using the
%in% operator. If the division does not correspond to a
state, we want to return “Error” so we know it is not a
US state. Lastly using dplyr::select() and
everything() functions within our function, we are going to
reorder the columns for them to make sense.
find_division <- function(tibble_non_county_data) {
# Make all the divisions based on classification
division_1 <- toupper(c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Rhode Island", "Vermont"))
division_2 <- toupper(c("New Jersey", "New York", "Pennsylvania"))
division_3 <- toupper(c("Illinois", "Indiana", "Michigan", "Ohio", "Wisconsin"))
division_4 <- toupper(c("Iowa", "Kansas", "Minnesota", "Missouri", "Nebraska", "North Dakota", "South Dakota"))
division_5 <- toupper(c("Delaware", "Florida", "Georgia", "Maryland", "North Carolina", "South Carolina", "Virginia",
"District of Columbia", "West Virginia"))
division_6 <- toupper(c("Alabama", "Kentucky", "Mississippi", "Tennessee"))
division_7 <- toupper(c("Arkansas", "Louisiana", "Oklahoma", "Texas"))
division_8 <- toupper(c("Arizona", "Colorado", "Idaho", "Montana", "Nevada", "New Mexico", "Utah", "Wyoming"))
division_9 <- toupper(c("Alaska", "California", "Hawaii", "Oregon", "Washington"))
# Create a new column called division to put the states in the correct spot. We will then sort the data so the division is next to the state
tibble_non_county_data <- tibble_non_county_data %>%
mutate(division = ifelse(
area_name %in% division_1, "New England", ifelse(
area_name %in% division_2, "Middle Atlantic", ifelse(
area_name %in% division_3, "East North Central", ifelse(
area_name %in% division_4, "West North Central", ifelse(
area_name %in% division_5, "South Atlantic", ifelse(
area_name %in% division_6, "East South Central", ifelse(
area_name %in% division_7, "West South Central", ifelse(
area_name %in% division_8, "Mountain", ifelse(
area_name %in% division_9, "Pacific", "ERROR"
)
)
)
)
)
)
)
)
)
) %>%
dplyr::select(area_name, division, everything())
# Return the updated tibble
return(tibble_non_county_data)
}
Our next function will allow us to split the data into two different
tibbles, one based on county-level data and the other by
non-county-level data, similar to what we did with the first data
source. We are going to call this function
split_into_level_type(). We are also going to combine our
last two created functions with this one to return the two correctly
structured data sets we desire.
split_into_level_type <- function(tibble_long) {
# Here we are going to split our data into county level and non-county level data
# First split into county level data
tibble_county_data <- tibble_long %>%
filter(grepl(pattern = ", \\w\\w", area_name))
# Create the new class called county
class(tibble_county_data) <- c("county", class(tibble_county_data))
# Here we are going to make the non-county level data
tibble_non_county_data <- tibble_long %>%
filter(!grepl(pattern = ", \\w\\w", area_name))
# Create the new class called state
class(tibble_non_county_data) <- c("state", class(tibble_non_county_data))
# Now combine our county-level function from before to get the state
tibble_county_data_updated <- find_state(tibble_county_data)
# Now combine our non-county-level function from before to get the division
tibble_non_county_data_updated <- find_division(tibble_non_county_data)
# Return the two tibbles
return(list(county = tibble_county_data_updated, state = tibble_non_county_data_updated))
}
Now the last thing we are going to do is put all the functions into
one function call. This will allow us to call only one function to get
all the things we want with a URL we are reading in for our data. We are
going to call this function
wrapping_data_processing_functions().
wrapping_data_processing_functions <- function(url, var_name = "enrollment") {
# Read in the data through the URL
tibble <- read_csv(url)
# Now reshape the tibble with how we want it to look with columns selected and removing the duplicate row
tibble_long <- reshaping_data(tibble, value_column = var_name)
# Now we want to find the year and survey type of the data
tibble_long_updated <- find_year_and_survey(tibble_long)
# Lastly we want to split the data into county and non-county (state) level data and include necessary information
tibble_education_data_list <- split_into_level_type(tibble_long_updated)
# Return the final list of two tibbles
return(tibble_education_data_list)
}
Call the wrapping_data_processing_functions two times to
read in and parse the two .csv files we have previously worked with and
store them in objects to be combined later.
test1 <- wrapping_data_processing_functions("https://www4.stat.ncsu.edu/~online/datasets/EDU01a.csv",
var_name = "enrollment")
test2 <- wrapping_data_processing_functions("https://www4.stat.ncsu.edu/~online/datasets/EDU01b.csv",
var_name = "enrollment")
Write a short function called combine_census that takes
the results of the two calls to our
wrapping_data_processing_functions and combines the
tibbles. That is, the two county level data sets get combined and the
two non-county level data sets get combined.
combine_census <- function(wrapper1, wrapper2) {
#use dplyr bind_rows to combine tibbles, here indexing on the returned list from the wrapper function for county
county_data_merge <- dplyr::bind_rows(wrapper1[["county"]], wrapper2[["county"]])
#index the list for the named element state
non_county_data_merge <- dplyr::bind_rows(wrapper1[["state"]], wrapper2[["state"]])
#retrun the two data sets as one object using list function
return(list(county = county_data_merge, state = non_county_data_merge))
}
#call the combine_census function to combine the result of the two calls to the wrapper function
test3 <- combine_census(test1, test2)
test3
## $county
## # A tibble: 62,900 × 6
## area_name state STCOU year education_measurement enrollment
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Autauga, AL AL 01001 1987 EDU0101 6829
## 2 Autauga, AL AL 01001 1988 EDU0101 6900
## 3 Autauga, AL AL 01001 1989 EDU0101 6920
## 4 Autauga, AL AL 01001 1990 EDU0101 6847
## 5 Autauga, AL AL 01001 1991 EDU0101 7008
## 6 Autauga, AL AL 01001 1992 EDU0101 7137
## 7 Autauga, AL AL 01001 1993 EDU0101 7152
## 8 Autauga, AL AL 01001 1994 EDU0101 7381
## 9 Autauga, AL AL 01001 1995 EDU0101 7568
## 10 Autauga, AL AL 01001 1996 EDU0101 7834
## # ℹ 62,890 more rows
##
## $state
## # A tibble: 1,040 × 6
## area_name division STCOU year education_measurement enrollment
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 UNITED STATES ERROR 00000 1987 EDU0101 40024299
## 2 UNITED STATES ERROR 00000 1988 EDU0101 39967624
## 3 UNITED STATES ERROR 00000 1989 EDU0101 40317775
## 4 UNITED STATES ERROR 00000 1990 EDU0101 40737600
## 5 UNITED STATES ERROR 00000 1991 EDU0101 41385442
## 6 UNITED STATES ERROR 00000 1992 EDU0101 42088151
## 7 UNITED STATES ERROR 00000 1993 EDU0101 42724710
## 8 UNITED STATES ERROR 00000 1994 EDU0101 43369917
## 9 UNITED STATES ERROR 00000 1995 EDU0101 43993459
## 10 UNITED STATES ERROR 00000 1996 EDU0101 44715737
## # ℹ 1,030 more rows
Since we have our own classes for county and
state, we can write our own custom plot function for these
using by adding a method to the plot function. For the
state plotting method, we want to write a function that plots the mean
value of the enrollment statistic across the years for each
Division.
#add the .state method to the plot function, adding an argument for which variable we want to calculate the mean for
plot.state <- function(df, var_name = "enrollment") {
new_df <- df %>%
#since we want the mean for each year and division, group by these variables
group_by(division, year) %>%
#use the summarise function to calculate the mean and the get function to reference the var_name
summarise(avg_enrollment = mean(get(var_name))) %>%
#use the filter function to remove the ERROR setting of the Division variable
filter(division != "ERROR")
#use ggpplot to plot the numeric year value on the x-axis, and the mean of the statistic for each Division and numeric year on the y-axis
ggplot(new_df, aes(x = year, y = avg_enrollment, color = division)) +
geom_line()
}
For the class county, we want to do a similar plotting function but
with more flexibility. This function should allow the user to specify
the state of interest, determine whether the
top or bottom most counties
should be looked at, and instruct how many of the top or bottom will be
investigated.
#specify default arguments as enrollment, AL, top, and 5 for the arguments included for the function
plot.county <- function(df, var_name = "enrollment", filter_state = "AL", filter_county = "top", filter_value = 5) {
#return a 1 if top is selected, 0 for bottom using the ifelse base r function, and the stop function to end the task if top or bottom is not used properly. this object will be used to multiply against the statistic vector created below to determine descending or ascending order
filter_multiplier <- ifelse(filter_county == "top", 1,
ifelse(filter_county == "bottom", 0,
stop("Must select either 'top' or 'bottom' as options for filter_county; default is 'top'")))
new_df <- df %>%
#filter the county data to only include data from the state specified
filter(state == filter_state) %>%
#group by the area name
group_by(area_name) %>%
#use the summarise function to find the overall mean of the variable specified, using the get function here as well
summarise(avg_enrollment = mean(get(var_name))) %>%
#sort the summarise statistic vector using the arrange function from largest to smallest if top (1 in filter_multiplier) is specified or smallest to largest if bottom (0 in the filter_multiplier) is specified. the value stored in filter_multiplier will allow us to run a conditional argument on a vector (avg_enrollment) where we check if the returned vector multiplied by the filter_multiplier is 0 for all values, or anything other than 0. in order to sort the statistic values from largest to smallest, use the desc function since smallest to largest is the default for the arrange function
arrange(if_else((avg_enrollment * filter_multiplier) > 0, desc(avg_enrollment), avg_enrollment)) %>%
#obtain the top or bottom x number of area_name from the previous step indexing from 1 to the filter_value given by the user or the default value
slice(1:filter_value)
#filter the data to only include the state specified by the user and the area_name's found in new_df (previous part)
filtered_df <- df %>%
filter((state == filter_state) & (area_name %in% new_df$area_name))
#use ggpplot to plot the numeric year value on the x-axis, and the numeric variable specified under var_name on the y-axis where a different color is used for each area_name
ggplot(filtered_df, aes(x = year, y = get(var_name), color = area_name)) +
geom_line()
}
Here we will combine all the newly created functions to do all of the
above in a few function calls. First we will run our data processing
function (wrapping_data_processing_functions) on the two
enrollment URLs used previously, specifying the name for the enrollment
column. We will then use our coming function
(combine_census) to put these into one object with two data
frames.
wrapper1 <- wrapping_data_processing_functions("https://www4.stat.ncsu.edu/~online/datasets/EDU01a.csv",
var_name = "enrollment")
wrapper2 <- wrapping_data_processing_functions("https://www4.stat.ncsu.edu/~online/datasets/EDU01b.csv",
var_name = "enrollment")
data_combine <- combine_census(wrapper1, wrapper2)
Use the plot function on the state data frame
#index using the named second list element, "state"
plot(data_combine[["state"]])
Use the plot function on the county data frame using
NC as the state, top being the group, and
10 as the number looked at
#index using the named second list element, "county"
plot(data_combine[["county"]], filter_state = "NC", filter_county = "top", filter_value = 10)
Use the plot function on the county data frame using
AZ as the state, bottom being the group, and
6 as the number looked at
plot(data_combine[["county"]], filter_state = "AZ", filter_county = "bottom", filter_value = 6)
Use the plot function on the county data frame using
only the default values for the arguments
plot(data_combine[["county"]])
Use the plot function on the county data frame using
OH as the state, top being the group, and
8 as the number looked at
plot(data_combine[["county"]], filter_state = "OH", filter_county = "top", filter_value = 8)
Lets read in and combine another couple data sets using the
wrapping_data_processing_functions and
conbine_census functions and apply these functions by
reading in four other data sets at the URLs given below
wrapper3 <- wrapping_data_processing_functions("https://www4.stat.ncsu.edu/~online/datasets/PST01a.csv",
var_name = "enrollment")
wrapper4 <- wrapping_data_processing_functions("https://www4.stat.ncsu.edu/~online/datasets/PST01b.csv",
var_name = "enrollment")
wrapper5 <- wrapping_data_processing_functions("https://www4.stat.ncsu.edu/~online/datasets/PST01c.csv",
var_name = "enrollment")
wrapper6 <- wrapping_data_processing_functions("https://www4.stat.ncsu.edu/~online/datasets/PST01d.csv",
var_name = "enrollment")
data_combine2 <- combine_census(wrapper3, wrapper4)
data_combine3 <- combine_census(data_combine2, wrapper5)
data_combine4 <- combine_census(data_combine3, wrapper6)
Use the plot function on the state data frame
plot(data_combine4[["state"]])
Use the plot function on the county data frame using
PA as the state, top being the group, and
5 as the number looked at
plot(data_combine4[["county"]], filter_state = "PA", filter_county = "top", filter_value = 5)
Use the plot function on the county data frame using
TX as the state, bottom being the group, and
12 as the number looked at
plot(data_combine4[["county"]], filter_state = "TX", filter_county = "bottom", filter_value = 12)
Use the plot function on the county data frame using
only the default values for the arguments
plot(data_combine4[["county"]])
Use the plot function on the county data frame using
NY as the state, top being the group, and
6 as the number looked at
plot(data_combine4[["county"]], filter_state = "NY", filter_county = "top", filter_value = 6)